import umap
import umap.plot
import pandas as pd
import plotly.express as px
from transformers import AutoTokenizer
from nltk.tokenize import sent_tokenize
from utils import *
[nltk_data] Downloading package punkt to [nltk_data] C:\Users\addison\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date!
news_data = pd.read_csv("news_data.csv")
# Verify that we loaded the correct dataset
news_data.head()
| altid | title | content | |
|---|---|---|---|
| 0 | sa1a70ab8ef5 | Davenport hits out at Wimbledon | World number one Lindsay Davenport has critic... |
| 1 | ta497aea0e36 | Camera phones are 'must-haves' | Four times more mobiles with cameras in them ... |
| 2 | ta0f0fa26a93 | US top of supercomputing charts | The US has pushed Japan off the top of the su... |
| 3 | ba23aaa4f4bb | Trial begins of Spain's top banker | The trial of Emilio Botin, the chairman of Sp... |
| 4 | baa126aeb946 | Safety alert as GM recalls cars | The world's biggest carmaker General Motors (... |
To cluster the news, we have to first convert its text content into numerical representation i.e. text embeddings. There are several approaches that we can take to create the text embeddings (i.e. term frequency, TF-IDF, Word2Vec, GloVe, Doc2Vec, Universal Sentence Encoder, BERT family of models etc).
For this exercise, we will use a pre-trained model from BERT family of models to create the text embeddings for the news. The BERT family of models approach is taken because its models are able to capture the semantic and contextual meaning of the text and also the word order. In particular, we will use the pre-trained model, all-MiniLM-L6-v2 from SentenceTransformers library to create the text embeddings for the news. The choice of model is arbitrary; any model from here should also deliver decent results so long as it was trained on English text.
BERT family of models however have a maximum sequence length of 512 tokens due to computational and memory constraints. Our chosen model was trained on data with a maximum sequence length of 128 tokens; it performs best when it is fed with inputs with less than or equal to 128 tokens.
Let's check the news dataset to see if any of the news content violates the 128 tokens constraint. If yes, we might need to perform additional processing steps to abide by the constraint.
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
# Tokenize the news content and calculate the number of tokens per news
news_data["num_tokens"] = news_data.content.apply(tokenizer.tokenize).apply(len)
Token indices sequence length is longer than the specified maximum sequence length for this model (716 > 512). Running this sequence through the model will result in indexing errors
# Visualize the distribution of the number of tokens in each article
fig = px.box(news_data, y="num_tokens",
points="all",
labels={"num_tokens": "Number of Tokens"},
hover_data=["title"])
fig.show()
It can be observed that even the shortest news content, "Dementieva prevails in Hong Kong" has 200 tokens, which exceeds the maximum sequence length of 128. By default, if we pass inputs with > 128 tokens into the model, the model will truncate the input sequence and make an inference only based on the first 128 tokens. This results in information loss, especially for longer news content.
One way to resolve this would be to split the news content into sentences and then obtain the text embeddings for each sentence. Clustering would also be done on a sentence level. Sentences in general are shorter and hence less likely to violate the 128 tokens constraint.
# Split content by sentences
news_data["sentence"] = news_data.content.apply(sent_tokenize)
# Store each sentence in its own row
news_data_sentence = news_data[["title", "sentence"]].explode(column = "sentence", ignore_index = True)
# Calculate the number of tokens in each sentence
news_data_sentence["num_tokens"] = news_data_sentence.sentence.apply(tokenizer.tokenize).apply(len)
Let's check if any of the sentences has more than 128 tokens
# Visualize the distribution of the number of tokens in each sentence
fig = px.box(news_data_sentence, y="num_tokens",
points="all",
labels={"num_tokens": "Number of Tokens in Sentence"},
hover_data=["title", "sentence"])
fig.show()
With the exception of 1 outlier (i.e. 131 tokens), all the sentences have less than 128 tokens. Even for the outlier, information loss due to truncation is minimal; only 5 tokens are truncated off (Although the maximum sequence length is 128 tokens, we have to allocate 2 token slots for the CLS and SEP tokens, this leaves us with only space for 126 tokens, hence number of truncated tokens = 131 - 126 = 5.
# Visualize the top 50 terms in the news
fig = px.bar(top_n_terms(news_data_sentence.sentence),
x="score", y="term",
orientation="h", height=1000,
title="<b>Top 50 Terms in the News by TF-IDF Scores<b>",
labels={"term": "Term", "score": "TF-IDF Score"})
fig.show()
The top 50 terms in the news are wide ranging. This suggests that the news covers a wide range of different topics.
# Load embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
# Get text embeddings
embeddings = embedding_model.encode(news_data_sentence.sentence, convert_to_numpy = True, normalize_embeddings = True)
# Perform dimensionality reduction on the text embeddings (i.e. reduce to 2 dimensions)
mapper = umap.UMAP(n_neighbors = 15, n_components = 2, metric = "cosine", random_state = SEED).fit(embeddings)
# Visualize the text embeddings in 2D space
# Enable inline plot
umap.plot.output_notebook()
# Create interactive plot
p = umap.plot.interactive(mapper, labels=news_data_sentence.title,
hover_data=news_data_sentence.title.to_frame(), point_size=4,
theme = "fire")
umap.plot.show(p)
From the plot above, it could be observed that there are at least 5 dense clusters in the document embeddings space. In some of the dense clusters, there are sentences with different news titles within the same cluster. This suggests the presence of common themes among some of the news content with different titles.
There are several clustering algorithms (i.e. KMeans) that we can use to group the sentences. But for this exercise, we will use a modified version of Top2Vec algorithm to cluster the sentences; we use a different approach from the original Top2Vec for the identification of topic words in the topic interpretation step.
Top2Vec algorithm was chosen over other clustering algorithms due to the benefits that it confers.
Create document embeddings using a pre-trained Sentence Transformer model.
Documents with similar semantic meanings would be placed close together in embedding space.
Apply (UMAP) to compress the document embeddings into lower dimensions.
Dimensionality reduction reduces the sparsity of the document embeddings, which helps in finding dense clusters.
Find dense clusters of documents using HDBSCAN.
# Initialize the topic model
topic_model = Top2Vec()
# Fit the topic model to our news sentences
topic_model.fit(news_data_sentence.sentence)
2021-12-05 23:26:31,997 - top2vec - INFO - Loading all-MiniLM-L6-v2 model. 2021-12-05 23:26:48,671 - top2vec - INFO - Loaded all-MiniLM-L6-v2 model successfully. 2021-12-05 23:26:48,687 - top2vec - INFO - Obtaining document embeddings. 2021-12-05 23:26:56,578 - top2vec - INFO - Creating lower dimension document embeddings. 2021-12-05 23:26:59,753 - top2vec - INFO - Finding dense areas of documents. 2021-12-05 23:26:59,776 - top2vec - INFO - Finding topics.
The intertopic distance map below gives an overview of the topics identified in the news dataset.
# Plot intertopic distance map for an overview of all the topics
topic_model.get_topics_info()
# Perform a left join to join the clustering results to its news titles
topic_sizes = topic_model.get_results().merge(news_data_sentence, how = "left",
left_on = "document", right_on = "sentence")
# Visualize the topic sizes
fig = px.histogram(topic_sizes, color="title",
x="topic",
category_orders=dict(topic=topic_sizes.topic.unique()),
title="<b>Topic Sizes<b>",
labels={"topic": "Topic"})
fig.update_xaxes(type='category')
fig.show()
print("\033[1m" + "Double click on a title to zoom in on that title." + "\033[0m")
Double click on a title to zoom in on that title.
12 clusters/topics were identified from the news dataset. Each cluster/topic is observed to contain more than one news title; this confirms our earlier conjecture that some of the news share the same common themes. It could also be observed that some of the news have content that spans across different clusters/topics. For instance, the news "Huge rush for Jet Airways shares" and "£1.8m indecency fine for Viacom" spans across 3 cluster/topics (i.e. topic 2, 3 and 4) and 5 clusters/topics (i.e. topic 1, 3, 5, 7 and 8) respectively.
We identify the top n words in each topic using c-TF-IDF scores. How this works is that for each topic, we join all the documents from that topic into a single document; i.e. each topic will have one joined document. We then calculate TF-IDF scores based on the joined documents.
For each topic, we interpret the topic by looking at the top n words in the topic and the top n sentences most representative of topic (computed based on cosine distance from the topic vector).
For this exercise, we will only analyze the top 3 largest topics (i.e. topic 0, 1 and 2).
# Interpret topic 0
topic_model.get_topic_info(0)
Top 10 Sentences:
"We got our goals early and in the minds of some players the job was done but then they got a goal and perhaps made us a bit nervous."
"We missed Serge badly against Scotland.
"But I'm delighted with the match result.
"In the second half we were better but it was frustrating because we got the goal - but one slip and they were back in it."
"Jose had results before he came to Chelsea and I think he will have an impact in the Premiership because he manages his team very cleverly."
"He struck the ball very well - he always has done - and I think it was the power and pace that beat the goalkeeper."
But Nigel Williams said: "I'm satisfied the game was handled correctly."
Goalscorer Jimmy Floyd Hasselbaink added: "It wasn't a particularly beautiful match to watch - but they made it difficult for us.
"Rafa is a good coach and a good man.
"But so is going back to Portugal - I'll be playing against some lads I played with at Boavista."
# Interpret topic 1
topic_model.get_topic_info(1)
Top 10 Sentences:
Mr Blair told MPs and peers: "I know from everyone here, in Cabinet and government, nothing is going to get in the way of a unified Labour Party with a unified position and winning the third term people desperately need."
Mr Blair used the tactic before the Iraq war to try to show he really was engaging with public concerns and you can expect to see it much more in the run-up to the election.
Mr Blair was speaking to MPs amid fresh rumours of a rift with Gordon Brown.
He told BBC News: "Those who co-operate or inspire these books, in my view, have to know that, whatever the short-term political or personal advantage that they think they might secure, they always do it by damaging the record, the unity and the re-election chances of the Labour Party and the government."
On Monday, Mr Blair's spokesman said: "The prime minister is determined that he will get on with the business of government because he believes that what people want."
But the Tories insisted they would hold ministers to account over the precise purpose of the scheme.
The current economic climate meant Britain could not afford the "reckless, George Bush-style tax cutting spree" planned by the Tories, he said.
He also attacked the government's "failure" to control immigration and asylum and criticised its record on the NHS, telling delegates Labour cannot be trusted on education or crime.
He attacked Tory plans to process asylum claims abroad - but Mr Howard said Labour had proposed the idea too.
The Tories have already accused the prime minister and his chancellor of behaving like "schoolboys squabbling in a playground".
# Interpret topic 2
topic_model.get_topic_info(2)
Top 10 Sentences:
"However, should the business outlook start to deteriorate, the Bank should stand ready to cut rates."
It had warned in September that the weakening US dollar, which has cut the value of foreign sales, would knock 125m euros off its operating profits.
"Dismal reports from the retail trade about Christmas sales are worrying, if they indicate a more general weakening in consumer spending."
They could even overtake men as the main buyers by 2007, if current rates persist, according to the research.
"The cost is enormous, and continues to be paid, and will not be reversed by any restructuring."
"Three years ago, every sector [of the economy] was hit by the crisis," said entrepreneur Drayton Valentine.
Excluding the car sector, US retail sales were up 0.6% in January, twice what some analysts had been expecting.
"It is widely accepted that, if house prices start falling more sharply, the risks facing the economy will worsen considerably."
With imports rising a similar amount, the deficit rose to $43.4bn.
Rising interest rates and the accompanying slowdown in the housing market have knocked consumers' optimism, causing a sharp fall in demand for expensive goods, according to a report earlier this week from the British Retail Consortium.
We can predict the topic of a new news content using the pipeline described below. The pipeline assumes that we already have a fitted topic model with topic vectors computed.
Let's run some predictions using our fitted topic model.
Expected Output:
1.The first news is about animals, a topic not in our topic model; the model should classify this as an outlier with topic -1.
2.The second news is about politics; the model should classify this to topic 1.
3.The third news is about sports; the model should classify this to topic 0.
# Sample news
new_news = [
"Pet cloning sounds novel but has been around for over a decade. People in favour of the practice say some pets are 'one in a million', but opponents cite animal welfare and other issues.",
"The conflicting accounts of what happened after Ms Raeesah Khan lied in Parliament about a sexual assault case have opened a can of worms that reveal divisions in the Workers' Party (WP), said political analysts, adding that the matter has raised questions about the party's credibility.",
"Ballon d'Or winner Lionel Messi said he is grateful to be named among the greatest players in world football, but the Argentina forward stressed he does not attach too much importance to individual glory."
]
# Get predictions
topic_model.predict(new_news)
| document | topic | |
|---|---|---|
| 0 | Pet cloning sounds novel but has been around f... | -1 |
| 1 | The conflicting accounts of what happened afte... | 1 |
| 2 | Ballon d'Or winner Lionel Messi said he is gra... | 0 |
Trending news topics evolve rapidly with time. Consequently, the usefulness of our fitted topic model is time sensitive. There is a need for us to refit our topic model with more up to date news data once there is a significant shift in the trending news topics.
As an extension to this exercise, we can also consider the temporal aspect of the news. By grouping the news based on their time periods, we can analyze how the trending news topics evolve with time.